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Abstract 

Both parametric and non-parametric approaches have 
demonstrated encouraging performances in the human 
parsing task, namely segmenting a human image into sev¬ 
eral semantic regions (e.g., hat, bag, left arm, face). In this 
work, we aim to develop a new solution with the advan¬ 
tages of both methodologies, namely supervision from an¬ 
notated data and the flexibility to use newly annotated (pos¬ 
sibly uncommon) images, and present a quasi-parametric 
human parsing model. Under the classic K Nearest Neigh¬ 
bor (KNN)-based nonpar ametric framework, the paramet¬ 
ric Matching Convolutional Neural Network (M-CNN) is 
proposed to predict the matching confidence and displace¬ 
ments of the best matched region in the testing image for 
a particular semantic region in one KNN image. Given a 
testing image, we first retrieve its KNN images from the 
annotated/manually-parsed human image corpus. Then 
each semantic region in each KNN image is matched with 
confidence to the testing image using M-CNN, and the 
matched regions from all KNN images are further fused, 
followed by a superpixel smoothing procedure to obtain the 
ultimate human parsing result. The M-CNN differs from the 
classic CNN Sum in that the tailored cross image match¬ 
ing filters are introduced to characterize the matching be¬ 
tween the testing image and the semantic region of a KNN 
image. The cross image matching filters are defined at dif¬ 
ferent convolutional layers, each aiming to capture a par¬ 
ticular range of displacements. Comprehensive evaluations 
over a large dataset with 7,700 annotated human images 
well demonstrate the significant performance gain from the 
quasi-parametric model over the state-of-the-arts l29l/ . 
for the human parsing task. 

1. Introduction 

Human parsing, namely partitioning the human body 
into several semantic regions (e.g., hat, left/right leg, glasses 
and upper-body clothes), has drawn much attention in re¬ 
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Figure 1. Illustration of our quasi-parametric human parsing 
framework. Given a testing image, its K Nearest Neighbour 
(KNN) images are retrieved from the manually-annotated image 
corpus. Then the input image is paired with each semantic region 
(e.g., hat, skirt and pants) of its KNN images and each pair is fed 
into the M-CNN individually. The M-CNN predicts the match¬ 
ing confidence and displacements between the input image pair. 
Then the corresponding label maps are transferred from the KNN 
region to the testing image. All transferred label maps from differ¬ 
ent KNN images are combined to produce a probability map for 
each label. The probability map is further refined by superpixel 
smoothing to generate the final parsing result. The color legend is 
shown in the top. For better viewing of all figures in this paper, 
please refer to the original zoomed-in color pdf file. 

cent years Esmiiiiiia and serves as the basis for many 
high-level applications, such as clothing classification (TJ 
and retrieval ESI. 

Several parametric and non-parametric human parsing 
methods are proposed and show very promising perfor- 
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Figure 2. The architecture of the proposed Matching Convolutional Neural Network (M-CNN) with parameters shown. The input image 
is cropped from a human image by O, which is paired with the semantic region (e.g., skirt). The image pair is fed into the M-CNN 
to estimate the matching outputs including 1-dim matching confidence (red square) and 4-dim displacements (blue squares). The M- 
CNN is composed of two kinds of paths. In the single image convolutional paths (top and bottom rows), the input and KNN region 
are independently convolved with single image filters (green dashed lines) for hierarchical feature representation. In the cross image 
convolutional patch (middle row), the cross image matching filters (purple dashed lines) are embedded in Conv2, Conv3, Conv4 and Conv5 
layers to achieve multi-ranged matching. Each matching kernel convolves with all features maps of previous convolutional layers. The 
feature maps of the three paths are fused to fit the network output. 


mances on the human parsing task. On one hand, para¬ 
metric methods Eiiiiiiigi learn knowledge, such as the 
appearance of regions of different semantic labels and the 
structural relationship among different labels, from anno¬ 
tated images. Those methods usually rely on the manu¬ 
ally designed structural models (H, which may not fit spe¬ 
cific data well and thus achieve only suboptimal perfor¬ 
mance. Moreover, for another set of new training data and 
semantic labels, new models have to be designed/retrained, 
which makes those parametric models impractical because 
new clothing styles may come out quite often. On the 
other hand, the non-parametric methods can flexibly use 
the newly annotated images on the fly and address the is¬ 
sues of parametric model, which are more appealing for 
practical applications ca. This kind of methods usually 
firstly build the pixel-level ca, superpixel-level ll^ or 
hypothesis-level |[26j [131 (El matching between a testing 
image and the annotated images in a corpus, then transfer 
the labels from the manually annotated images to the test¬ 
ing image based on the matching outputs, and Anally fuse 
the transferred labels by heuristic aggregation schemes (typ¬ 
ically majority voting). However, the quality of matching is 
usually limited by the lack of explicit semantic meaning of 
the bottom-up superpixels or hypotheses. 

The above-mentioned parametric and non-parametric 
human parsing methods rely on the hand designed pipelines 
composed of multiple sequential components, e.g., hand¬ 
crafted feature extraction, bottom-up over-segmentation, 
human pose estimation, manually designed complex model 
structure. Therefore, the possibly bad performance of each 
component may become the bottleneck of the performance 


of the whole pipeline. For example, the human pose estima¬ 
tion, an important component in the above pipeline, itself is 
quite a challenging task. Such a sequential processing strat¬ 
egy usually makes the whole pipeline mostly suboptimal. 
Instead of a combination of multiple sequential steps, sev¬ 
eral Convolutional Neural Network (CNN) based methods 
are proposed for an end-to-end image parsing EElEIlia. 
However, these deep models cannot be easily updated to in¬ 
corporate new semantic labels. 

To address these issues, we propose a quasi-parametric 
human parsing framework, which inherits the merits of both 
the parametric and non-parametric models. The proposed 
end-to-end framework is able to take full advantage of the 
supervision information from annotation training data, and 
meanwhile is easy to extend for new added labels. The 
core part of the proposed framework is a specially de¬ 
signed Matching Convolutional Neural Network (M-CNN) 
to match any semantic region of a KNN image (also denoted 
as KNN region in this paper) to the testing image. 

As shown in Fig. we first apply the human detec¬ 
tion m to a testing image and obtain the human centric im¬ 
age. Then the K Nearest Neighbours (KNN) images of the 
test image is retrieved from the annotated/manually-parsed 
image corpus. Each KNN image derives several seman¬ 
tic regions, which are generated by masking out the back¬ 
ground region with the mean image of the image corpus. 
Three KNN regions for hat, skirt and pants are shown in 
Fig. [2 Then, the pair of the test image and each of the 
KNN regions is fed into the proposed M-CNN to estimate 
their matching confidence and displacements. The match¬ 
ing confidence measures how the KNN region matches the 
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input image while the displacements describe the coordi¬ 
nate translations between the KNN region and the matched 
region in the testing image. The matching confidences are 
then averaged over all KNN regions and thresholded to pre¬ 
dict whether a specific label is present. For the labels pre¬ 
dicted to be present, such as hat and skirt in Fig.[2 the cor¬ 
responding label maps can be transferred from the KNN re¬ 
gions to the testing image based on the estimated displace¬ 
ments. For the labels predicted as invisible, such as pants 
in Fig. no transferred label map is generated. Then, all 
matched regions of a specific label are combined to produce 
a probability map. Finally, the probability maps for all la¬ 
bels are refined by a superpixel smoothing step to get the 
final parsing result. 

Reliable matching between an input image and a KNN 
region is challenging, because the matching needs to handle 
the large spatial variance of semantic regions. For example, 
the bags can be placed on the left, right or in front of the hu¬ 
man body. The proposed M-CNN is able achieve accurate 
multi-ranged matching. As shown in Fig. M-CNN con¬ 
tains three paths, i.e., two single image convolutional paths 
and a cross image convolutional path. The single image 
convolutional path receives the input image or a particular 
KNN region, and produces its discriminative hierarchical 
feature representations layer by layer. The cross image con¬ 
volutional path embeds cross image filters into every con¬ 
volutional layer to characterize the multi-ranged matching. 
The cross image filters are applied to all feature maps in pre¬ 
vious convolutional layers, including the single image fea¬ 
ture maps and cross image feature maps. Because the scale 
of receptive fields of the feature maps increase when tracing 
up the M-CNN, the cross image matching filters capture the 
displacements from the near-range to the far-range. There¬ 
fore the feature maps from the cross image convolutional 
path can well represent the displacements. Because the fea¬ 
ture maps generated by the two single image convolutional 
paths are excellent feature representations, their absolute 
difference maps are calculated as another measurement of 
the displacements. The difference maps are combined with 
the cross image feature maps and then link to the subse¬ 
quent fully connected layers. Finally, the matching confi¬ 
dence and displacements are regressed. Since the M-CNN 
targets at matching an input image and any KNN region of 
any semantic label, it can work even if new semantic labels 
are included. Instead of training a M-CNN for each label, 
we train a unified M-CNN for all KNN regions of all labels. 

Comprehensive evaluations over a large dataset with 
7, 700 annotated human images well demonstrate the ef¬ 
fectiveness of our quasi-parametric framework. The major 
contributions are summarized as follows: 

• We build a novel deep quasi-parametric human parsing 
framework. It can learn from annotated data and also 
flexibly use newly annotated (possibly uncommon) im¬ 


ages. 

• We propose a Matching Convolutional Neural Net¬ 
work (M-CNN) to match a semantic region of a KNN 
image to a testing image. The novel cross image filters 
are embedded into different convolutional layers, each 
aiming to capture a particular range of displacements. 

• We integrate all the step-by-step components (over¬ 
segmentation, pose estimation, feature extraction, la¬ 
bel modeling, etc.) in traditional pipelines into one 
unified end-to-end deep CNN framework. 

2. Related Work 

In this section, we sequentially review the parametric 
human parsing methods, non-parametric methods and deep 
learning based methods. 

For parametric human parsing, Yamaguchi et al. 1291 
proposed to boost the human parsing with pixel-level clas¬ 
sification by relying on human pose estimation. To capture 
more complex contextual information, Dong et al. (H de¬ 
signed an And-or Graph structure to model the correlations 
of a group of parselets, and their extension work El uni¬ 
fied the human parsing and pose estimation in one frame¬ 
work. The image co-segmentation and region co-labeling 
for human parsing were also used to capture the correlations 
between different human images (SOl. In addition, Liu et 
al. {TEh utilized user-generated category tags to build a hu¬ 
man parser. For general image parsing, Tighe et al. ES 
proposed a segmentation by detection approach. Firstly, 
the bounding boxes of the objects are estimated by exem¬ 
plar SVM 1^ . based on which the segmentation masks are 
transferred from the image corpus to the input image. In 
general, the power of existing parametric methods is largely 
limited by the suboptimal performance of many hand de¬ 
signed intermediate components, such as pose estimation, 
and also cannot be easily extended to parse new labels. 

In non-parametric human parsing, pixels ca, superpix¬ 
els 1241 13 [m and object proposals (261 Ell EH EH were 
used to facilitate non-parametric image parsing. Specifi¬ 
cally, the model of Yamaguchi et al. flEl transferred pars¬ 
ing masks from retrieved examples to the query image. 
Their label transferring is based on superpixels, which are 
generated by over-segmentation based on the low level ap¬ 
pearance cues and therefore lack semantic meaning. Liu 
et al. Ca used SIFT Flows to build the pixel-pixel cor¬ 
respondence and the dense deformation field between im¬ 
ages. However, the optimization problem for finding the 
SIFT flow is rather complex and expensive to solve. Re¬ 
cently, Long et al. (201 proved the better performance of 
convolutional activation features over traditional features, 
such as SIFT, for tasks requiring correspondence. Overall, 
the non-parametric methods are limited by the inaccurate 
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matching, which results in the noises/outliers during the la¬ 
bel transferring. 

Our quasi-parametric model integrates the advantages 
of both parametric models and non-parametric models by 
the proposed M-CNN. There exist some works on semantic 
segmentation with CNN architectures. Girshick et al. O 
and its extension work El proposed to classify the can¬ 
didate regions by CNN for semantic segmentation. Wang 
et al. presented a joint task learning framework, in 
which the object localization task and the object segmen¬ 
tation task are tackled collaboratively via CNN. Farabet et 
al. 0 trained a multi-scale CNN from raw pixels to extract 
deep features for assigning the label to each pixel. The re¬ 
current CNN ED was proposed to speed up scene parsing 
and achieved the state-of-the-art performance. Our M-CNN 
inherits the merit of existing CNN parsing models in our 
single image convolutional path. It differs from all existing 
CNN based parsing models in that we handle a pair of im¬ 
ages instead of a single image and we incorporate cross im¬ 
age filters to specifically characterize multi-ranged match¬ 
ing. Last but not least, M-CNN can effortlessly handle new 
semantic labels. 


3. Quasi-parametric Human Parsing 

For each human image, we first retrieve its KNN images 
from the annotated image corpus (Sec. |3.1| ). Then, M-CNN 
predicts the matching confidence and displacements be¬ 
tween the input image and a semantic region from one KNN 
image, based on which a label map is generated (Sec. |3.2| ). 
Finally, all label maps are fed into a post processing proce¬ 
dure to produce the parsing result (Sec. [33). 

3.1. K-Nearest-Neighbor Retrieval 


Given an image pair M-CNN learns a regres¬ 

sor to estimate the matching confidence and displacements 
between them. The 1-dim matching confidence Cki indi¬ 
cates how well Qki can be matched to /. In the trainig 
phase, it is a binary index indicating whether the KNN re¬ 
gion is matched to the input image, gki is considered to 
be matched with / if and only if / contains the label 1. 
In the testing phase, higher value of Cki indicates that bet¬ 
ter matching is found. We denote the coordinates of the 
upper left and lower right comer of the KNN region in 
gu as Ug = [gll,gll] and Wg = [gli,gli]. Similarly, 
the coordinates of the matched region in / are denoted as 
Ui = and Wi = /?/ 2 j coordinates are 

normalized by the height and width of the image into the 
range [0,1]. The 4-dim displacements tki represents the dif¬ 
ferences between Ug and Uj, Wg and Wj. Instead of using 
a classification loss ca, we train the M-CNN by minimiz¬ 
ing £2 distance between the ground truth [ckiUki] and the 
prediction [ckiUki]. The corresponding £2 loss for each im¬ 
age / is defined as 

= lU S E l|Cfci - hif 

k=l 1=1 
K 

^ LxK S ^ ~ ikl 

where (/9 (•) is the label set contained in the specific image. 
The first term in Eq. Q is the loss for matching confidence 
while the second term corresponds to the loss for displace¬ 
ments. We penalize the displacements loss when both the 
KNN image gk and the input image / contain the label 1. 
Then the losses of all training pairs are summed and the pa¬ 
rameters are learned by back propagation. 



For each of the input images, we use the human detection 
algorithm m to detect the human body. The resulting hu¬ 
man centric image I is then rescaled to 227 x 227 x 3. We 
then extract a global 4, 096-dimensional feature from the 
penultimate fully-connected layer in the pre-trained CNN 
model trained on ILSVRC 2012 classification dataset based 
on the Krizhevsky architecture ifT^ . Its KNN images G = 
{gi^ g 2 ^ 9k} are retrieved from the image corpus based 

on the deep features. 

3.2. Matching Convolutional Neural Network 



Input image Near-ranged Middle-ranged Far-ranged 


Figure 3. Illustration of the multi-range matching. For a specific 
label, e.g., bag, the matchings between the input image and KNN 
regions may be near/middle/far-ranged. To facilitate the display, 
we only keep the bags regions unchanged and gray other pixels. 


Input, Output and Loss Function: Given a label I G 
{1,..., I/}, where L is the total number of labels, the input 
image I and each KNN regions gki from KNN image gk 
form a pair and are fed into the M-CNN. Note that gk is 
from the image corpus and thus its label map is known, gki 
is generated by keeping the regions of the label I in gk and 
masking out other regions by the mean image calculated 
from the image corpus. If gk does not contain label /, gki is 
exactly the mean image. 


Architecture: Since the KNN images are retrieved 
based on the global appearance similarity, the KNN region 
of each label may locate quite differently in images. For 
example, in Fig. the bags can be placed on the left side 
or right side or in front of the human body. M-CNN is de¬ 
signed to estimate the multi-ranged matching by embedding 
the cross image matching filters in different convolutional 
layers. As shown in Fig.|^ M-CNN contains two kinds of 
paths, i.e., two single image convolutional paths and one 
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cross image convolutional path the outputs of these three 
paths are further fused to estimate the matching confidence 
and displacements. The single image path aims for hierar¬ 
chical feature representation while the cross image convo¬ 
lutional path estimates the displacements between the input 
pair. 

Single Image Convolutional Path: We have two instan¬ 
tiations of the single image path in the top and bottom row 
of Fig. each of which separately processes / or gki. They 
share the same architecture and extract the hierchical fea¬ 
ture representations of / or gki. The outputs are their re¬ 
spective feature maps in “conv5”. In this path, the single im¬ 
age filters of the next convolutional layers are connected to 
those feature maps in the previous layer, shown as the green 
dashed line in Fig. The ReLU non-linearity is applied to 
the output of every convolutional layer. The sizes of feature 
maps are gradually reduced by using the stride of 2 for all 
the convolutional layers. The most important difference be¬ 
tween M-CNN and the infrastructure in iniisthatM-CNN 
removes the pooling layer. Although pooling is useful for 
enhancing translation invariance for object recognition, it 
loses precise spatial information that is necessary for accu¬ 
rately predicting the locations of the labels ifTOl . The details 
about the network parameters, such as image/feature map 
sizes, kernel size/numbers are shown in Fig. The pow¬ 
erful representation capability of the single image convolu¬ 
tional path lays the foundation for the accurate estimation 
of the matching confidence and displacements. 

Cross Image Convolutional Path: The cross image con¬ 
volutional path lies in the middle row of Fig. It outputs 
the cross image feature maps in “conv5” layer. The m-th 
cross image feature map in the j-th layer is generated 
by convolving the corresponding matching filter (includ¬ 
ing three components f-^q^rn and with both 

singe image and cross image feature maps 

in the j — 1-th layer. The component links 

the p-th (out of all P) input image feature map x^j-i^p in the 
j — 1-th layer to Analogously, the f^q^^n component 
links the g-th (out of all Q) KNN region feature map 
to Moreover, the component links the t-th (out 
of all T) cross image feature map ^P" 

eration of the matching filters from one layer to the next is 
shown as the purple dashed line in Fig. Mathematically, 
the cross feature map is calculated by: 

p 

= max(0, bj^rn + E 

P=1 

+ E + E /ft.m * 

q=l t=l 

( 2 ) 

where * denotes convolution and bj^m is the bias for the m- 
th output map. max(0, •) is the non-linear activation func¬ 


tion, and is operated element-wisely. From Eq. ([^, we can 
see that the cross image feature map in the next layer is 
calculated by considering both single and cross image fea¬ 
ture maps in the previous layer, and thus the displacements 
between the input image / and KNN region gki can be ef¬ 
fectively estimated. Note that along with the M-CNN, the 
receptive fields of different layers of the single image and 
cross image convolutional path gradually increase ED. In 
this way, multi-ranged matchings can be achieved. 

Finally, we fuse the feature maps from two single image 
paths and one cross image path. More specifically, the abso¬ 
lute differences of the feature maps of the input image and 
the KNN region (from the single image convolutional path) 
are first calculated and then are stacked with the output of 
the cross image convolutional path. Our fusion is applied on 
the feature maps. It is different from “Siamese” O architec¬ 
ture which calculates the absolute differences of the fully- 
connected representations. Experiments show the our fu¬ 
sion strategy outperforms “Siamese” by keeping more spa¬ 
tial information. 

3.3. Post Processing 

Given the matching confidences and displacements esti¬ 
mated by M-CNN, the parsing result of the input image can 
be calculated as follows. Firstly, the confidence of / con¬ 
taining the l-th label is calculated by averaging matching 
confidence Cki for all KNN regions satisfying I G ^{gu)- 
If the confidence is greater than a threshold ^i, the label I 
is predicted as visible in the input image, otherwise pre¬ 
dicted as invisible. Secondly, we estimate the locations 
of the visible labels. More specifically, the coordinates 
of the matching region in the input image I is calculated 
based on the matching displacements tki and the ground- 
truth coordinates of gki. Then, we morph the associated 
ground-truth label mask of gki into matched region in /. In 
this way, we get a probability map Mi of / for each label 
I G [1, T]. We pixel-wisely max all Mi for all the labels and 
get the foreground probability. The pixels with the proba¬ 
bility larger than a threshold ^2 are regarded as the rough 
foreground, while the remaining are the rough background. 
The rough foreground and background are further eroded 
by a filter size 10 to produce the final foreground and back¬ 
ground seeds. Based on the seeds, we can obtain the back¬ 
ground probability by the algorithm (Sl . The obtained back¬ 
ground probability is combined with the foreground prob¬ 
ability map Ml, I G [1,T], based on which, we can get 
an initial human parsing results with the pixel-wise Max¬ 
imum a Posterior Probability (MAP) assignment. Finally, 
to respect boundaries of actual semantic labels, we further 
over-segment I using the entropy rate based segmentation 
algorithm ifTTll and assign the label of the superpixel by the 
majority of its covered pixels’ initial parsing results. 
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4. Experiments 

4.1. Experimental Settings 

Datasets: We use the dataset in na pixel-wisely labeled 
by the 18 categories defined by Daily Photos dataset ll4l . 
The dataset contains 7, 700 images (6, 000 for training, 700 
for validation and 1, 000 for testing). We adopt four evalu¬ 
ation metrics, i.e., accuracy, average precision, average re¬ 
call, and average F-1 scores over pixels 1^ . 

Training Image Pairs Generation: To reduce over¬ 
fitting in the model training and partially address the de¬ 
tection error, we enlarge the cropped human centric images 
region by 1 and 1.2 times. We also horizontally mirror the 
images. In short, each image has 4 variations and training 
data can be greatly augmented. 

For each of 6, 000 training images, 50 KNN images are 
retrieved from the image corpus. Each training image and 
each KNN region form a training pair. After unevenly sam¬ 
pling of the training pairs to balance different labels, we fi¬ 
nally have 5 million pairs, which even outnumbers the that 
of ILSVRC2012 (TJl. We shuffle the training pair in order 
to increase the diversity of each epoch. 

Implementation Details: We implement the M-CNN 
under the Caffe framework m and train it using stochastic 
gradient descent with a batch size of 128 examples, momen¬ 
tum of 0.9, and weight decay of 0.0005. We use an equal 
learning rate for all layers. The learning rate is adjusted 
manually by dividing 10 when the validation error rate stops 
decreasing with the current learning rate. The learning rate 
is initialized at 0.0005. We train M-CNN for roughly 50 
epochs, which takes 11 to 12 days on one NVIDIA GTX 
TITAN 6GB GPU. In the training phase, we first calculate 
the element-wise mean and variance of the matching con¬ 
fidence and displacements of the whole image corpus, and 
element-wisely normalize the training output by the mean 
and variance. In the testing phase, we project the matching 
confidence and displacements estimated by M-CNN to their 
absolute values by the mean and variance. In the post pro¬ 
cessing step, the thresholds and ^2 are set to be 0.8 and 
0.5. The number of KNN regions for each input image is 
set as 9. 

4.2. Results and Analysis 

Comparison with The State-of-the-arts: We compare 
our M-CNN based quasi-parametric human parsing frame¬ 
work with two state-of-the-arts: Yamaguchi et al. 1^ and 
PaperDoll 1^ . We use their publicly available codes and 
train their models with the same 6, 000 training images as 
our method for fair comparison. We do not compare with 
Dong et al. m because their code is not publicly available 
and their method is reported to be slower than ours. 

The average results for all labels are in Table The 
methods of Yamaguchi et al. 1291 and the PaperDoll 1^ 


trained on the same 6, 000 training images and tested on 
the same 1,000 images as M-CNN, and their average Fi 
scores achieve 41.80% and 44.76%. Our “M-CNN” sig¬ 
nificantly outperforms these two baselines by over 21.01% 
for Yamaguchi et al. 1^ and 18.05% for PaperDoll 1^ . 
“M-CNN” also gives a huge boost in foreground accuracy: 
the two baselines achieve 55.59% for Yamaguchi et al. l29l 
and 62.18% for PaperDoll 1^ while “M-CNN” obtains 
73.98%. “M-CNN” also obtains much higher precision 
(64.56% vs 37.54% for |29l and 52.75% for El) as well 
as higher recall (65.17% vs 51.05% for 12^ and 49.43% 
for 1^ ). This verifies the effectiveness of our end-to-end 
M-CNN based quasi-parametric framework. 

We also present the FI-scores for each label in Table 
Generally, “M-CNN” shows much higher performance than 
the baselines. In terms of predicting labels for small se¬ 
mantic regions such as hat, belt, bag and scarf, our method 
achieves a large gain, e.g. 43.38% vs 11.43% 1^ and 
2.95% EH for scarf, 57.87% vs 24.53% E3, 30.52% EH 
for bag and 38.45% vs 14.68% El and 16.94% El for 
belt. It demonstrates that our quasi-parametric network can 
effectively capture the internal relations between the labels 
and robustly predict the label masks with various clothing 
styles and poses. 

Ablation of Our Networks: We also extensively ex¬ 
plore different CNN architectures to demonstrate the effec¬ 
tiveness of each component in M-CNN more transparently. 
The architecture of M-CNN is shown in Fig. and other 
variants are constructed by gradually adding/eliminating the 
cross image filters in different layers. M-CNN contains 4 
cross image matching filters from layers conv2 to convb. 
“M-CNN (cross 5,4,3,2,1)” is obtained by adding 11x11x6 
cross image filters in the “convl” layer to the “M-CNN”. 
The added matching filters are applied on the stacked im¬ 
age composed of the RGB channels of input image and 
KNN region. In addition, we continue to remove the cross 
image matching filters layer by layer, producing “M-CNN 
(cross 5,4,3)”, “M-CNN (cross 5,4)” , “M-CNN (cross 5)” 
and “M-CNN(w/o cross)”. Note that no matching filters are 
used in the “M-CNN(w/o cross)” architecture, where “w/o” 
stands for without. For fair comparison, we keep the num¬ 
ber of feature maps of each convolutinal layer unchanged 
for different M-CNN variations. Therefore, the number of 
removed cross image filters is evenly added to the corre¬ 
sponding two single image layers. For example, the num¬ 
ber of feature maps for both single and cross image con¬ 
volutional paths are all 30 in “conv2”. After we remove 
the cross matching filters to derive “M-CNN (cross 5,4,3)”, 
“M-CNN (cross 5,4)” and “M-CNN (cross 5)”, the num¬ 
ber of feature maps in the single image convolutional path 
is set as 45. In addition, we compare with a classic CNN 
based verification architecture named “Siamese” m which 
is a composite structure of two identical sub-networks. The 
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Table 1. Comparison of parsing performances with several architectural variants of our model (cross image matching filters embedded into 
different convolutional layers, with and without superpixel smoothing) and two state-of-the-arts. 


Method 

Accuracy 

Eg. accuracy 

Avg. precision 

Avg. recall 

Avg. Fi score 

Yamaguchi et al. 1^ 
PaperDoll (2^ 

84.38 

88.96 

55.59 

62.18 

37.54 

52.75 

51.05 

49.43 

41.80 

44.76 

Siamese 13 

85.24 

56.42 

50.27 

48.88 

47.08 

M-CNN (w/o cross) 

88.30 

69.84 

58.63 

59.52 

56.99 

M-CNN (cross 5 ) 

88.62 

69.88 

60.89 

60.47 

58.07 

M-CNN (cross 5,4) 

89.41 

72.44 

58.93 

63.16 

60.03 

M-CNN (cross 5,4,3 ) 

88.97 

70.84 

60.27 

62.23 

60.36 

M-CNN 

89.57 

73.98 

64.56 

65.17 

62.81 

M-CNN (cross 5,4,3,2,1) 

89.42 

71.86 

63.13 

63.49 

61.53 

M-CNN(w/0 ss) 

87.08 

71.73 

55.88 

65.32 

59.39 


Table 2. Fi scores of foreground semantic labels. Comparison of -scores with several architectural variants of our model and two 
state-of-the-art methods. 


Method 

Hat 

Hair 

S-gls 

U-cloth 

[ Skirt 

Pants 

Dress 

Belt 

L-shoe 

R-shoe 

Face 

L-leg 

R-leg 

L-arm 

R-arm 

Bag 

Scarf 

Yamaguchi et al. ll^ 

8.44 

59.96 

12.09 

56.07 

17.57 

55.42 

40.94 

14.68 

38.24 

38.33 

72.1 

58.52 

57.03 

45.33 

46.65 

24.53 

11.43 

PaperDoll 

1.72 

63.58 

0.23 

71.87 

40.2 

69.35 

59.49 

16.94 

45.79 

44.47 

61.63 

52.19 

55.6 

45.23 

46.75 

30.52 

2.95 

Siamese 13 

74.02 

44.70 

11.78 

65.69 

74.69 

61.52 

70.30 

1.85 

37.12 

34.16 

39.36 

50.24 

51.65 

32.54 

30.59 

24.23 

39.37 

M-CNN (w/o cross) 

77.67 

63.11 

29.68 

72.26 

73.66 

70.92 

78.79 

25.51 

43.25 

44.53 

67.99 

60.90 

64.90 

47.35 

39.78 

41.14 

40.06 

M-CNN (cross 5) 

67.61 

65.30 

38.51 

69.06 

73.95 

69.33 

81.88 

28.41 

46.78 

41.01 

70.82 

59.49 

66.28 

52.61 

39.73 

40.89 46.01 

M-CNN (cross 5,4) 

76.42 

65.91 

46.97 

71.51 

74.29 

68.43 

82.16 

41.31 

41.83 

43.03 

73.27 

59.35 

62.18 

52.63 

54.04 

46.25 

40.58 

M-CNN (cross 5,4,3) 

79.38 

67.64 

34.50 

70.72 

76.57 

69.17 

83.81 

25.69 

54.82 

44.61 

75.22 

63.44 

67.74 

54.94 

48.59 

40.10 46.40 

M-CNN 

80.77 

65.31 

35.55 

72.58 

77.86 

70.71 

81.44 

38.45 

53.87 

48.57 

72.78 

63.25 

68.24 

57.40 

51.12 

57.87 

43.38 

M-CNN (cross 5,4,3,2,1) 

77.34 

65.56 

45.19 

70.64 

78.64 

69.95 

82.72 

46.52 

45.72 

44.45 

71.27 

61.59 

63.49 

53.23 

50.72 

55.96 

45.48 

M-CNN(w/o ss) 

72.26 

61.32 

49.89 

68.74 

74.80 

66.41 

78.24 

40.13 

47.86 

43.86 

67.66 

51.61 

59.37 

50.34 

44.38 

50.78 

46.04 


outputs of the two sub-networks are fully-connected linear 
layer responses, whose absolute differences are calculated 
as features to fit the final matching confidence and displace¬ 
ments. Finally, we compare with“M-CNN w/o ss ” which is 
the same with M-CNN except that the superpixel smoothing 
processing step is skipped. 

The average scores over all labels in Table offer 
following observations. Incrementally adding cross im¬ 
age matching filters into more convolutional layers pro¬ 
duces 4 variations of “M-CNN”, including “M-CNN (w/o 
cross)”, “M-CNN (cross 5)”, “M-CNN (cross 5,4)”, “M- 
CNN (cross 5,4,3)” and “M-CNN”. Their Fi scores in¬ 
crease from 56.99%, 58.07%, 60.36% to 62.81%. The 
highest Fi score is 62.81% which is reached by “M-CNN”. 
The gradually improving performance validates that insert¬ 
ing more cross image matching into multiple convolutional 
layers can help achieve better matching. Adding a cross 
image matching kernel in the “convl” layer drops the Fi 
score from 62.81% to 61.53%. The reason for the relatively 
lower result is that the receptive field corresponding to the 
first cross image matching kernel is small and only involves 
certain part of a semantic label. However the targets of this 
work, i.e., the matching confidence and displacements, are 
defined on the semantic label level and thus are beyond the 
receptive field of the cross image matching kernel inserted 
in the “convl” layer. When the M-CNN grows deeper, the 


receptive fields become much larger, with a higher probabil¬ 
ity to cover the semantic label, which can faciliate estimat¬ 
ing the semantic label-level displacements. In addition, “M- 
CNN (w/o cross)” performs better than “Siamese” because 
our ultimate task is to estimate displacements. “M-CNN 
(w/o cross)” calculates the difference between two “conv5” 
layer feature maps while “Siamese” calculates the differ¬ 
ences of two fully connected features, where spatial struc¬ 
tures in the 2-dim images are partially lost. The lower Fi 
score of “M-CNN w/o ss” compared with “M-CNN” proves 
that the adopted superpixel smoothing technique can better 
preserve the boundary information although it is a simple 
and fast voting of pixels’ labels. The superior performance 
of “M-CNN w/o ss” than the state-of-the-arts ||29l[28l shows 
that our M-CNN has the capability of directly predicting 
more reliable label masks even without the post processing 
step. 

Sensitivity to the Number of K: In Fig. we report 
the performance of our human parsing method with respect 
to different numbers of KNN images. We find that M-CNN 
reaches the highest Fi score 63.58% when 9 KNN regions 
are considered. When only 1 KNN region is considered, the 
performance is still quite competitive (56.92%). 

Qualitative Parsing Results Comparison: Fig. 
shows the comparison between M-CNN and Paper- 
Doll 1^ . The results demonstrate that our method can suc- 
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■ Scarf 

■ Belt 


Bag ■ R-arm ■ L-arm 

Dress ■ Pants H Skirt 


R-leg ■ L-leg ■ Face ■ R-shoe ■ L-shoe 

Upper ■ Glass ■ Hair ■ Hat O Null 


Figure 4. Comparison of our parsing results with the PaperDoll Method. For each image, we show the testing image, parsing results by 
PaperDoll |[28l, our “M-CNN” sequentially. 



Figure 5. Fi scores for different numbers of KNN images. 

cessfully predict the label maps with small regions, which 
can be attributed to the reliable label transferring from the 
KNN regions. For example, the bag, scarf and hat in three 
images in the first column are successfully located by M- 
CNN but totally missed by PaperDoll. Another example is 
that M-CNN successfully finds the small sunglasses in the 
first row, which are missed by PaperDoll. In addition, it can 
be observed that the results from M-CNN are robust to pose 
variations. As shown in the bottom row, M-CNN can ac¬ 
curately estimate the locations of left and right arms, while 
PaperDoll cannot. The superior performance is because our 
method is an end-to-end model while PaperDoll relies on 
a separate pose estimation preprocessing step. Another ob¬ 
servation of Fig. I^is that our segmented regions are more 
complete while the PaperDoll regions are fragmented, such 
as the lower right result. That is because PaperDoll transfers 
labels based on oversegments which lack explicit semantic 
meaning. 


5. Conclusion and Future Work 

In this work, we tackle the human parsing problem by 
proposing a new quasi-parametric model. Our unified end- 
to-end quasi-parametric framework inherits the merits of 
both parametric and non-parametric parsing methodologies. 
It takes advantage of the supervision from annotated data 
and can be easily extended to newly annotated images and 
labels. To characterize the multi-ranged matching, we pro¬ 
pose a Matching Convolutional Neural Network, which 
contains two single image convolutional paths for better 
feature representation and a cross image convolutional path 
where cross image matching filters are embedded into the 
convolutional layers. Extensive experimental results clearly 
demonstrate significant performance gain from the quasi- 
parametric model over the state-of-the-arts. In the fu¬ 
ture, we will extend the framework to other exemplar-based 
tasks, such as face parsing. Moreover, we plan to use other 
more power network structure, e.g, GoogLeNet ||23]| . 
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